Implement column-level PII protection for sample collection#832
Implement column-level PII protection for sample collection#832devin-ai-integration[bot] wants to merge 1 commit into
Conversation
- Add disable_samples_on_pii_columns and pii_column_tags config variables - Create get_pii_columns_from_parent_model helper macro for column PII detection - Modify query_test_result_rows to dynamically exclude PII columns from SELECT - Add comprehensive integration tests for column-level PII protection - Maintain backward compatibility with existing table-level PII detection - Follow existing patterns for configuration, tag handling, and testing Addresses Linear issue ELE-4850 Co-Authored-By: Yosef Arbiv <yosef.arbiv@gmail.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
WalkthroughThe changes introduce column-level PII protection by adding configuration variables, a macro to detect PII columns based on tags, and logic to exclude such columns from test result samples. Integration tests are added to verify correct behavior when PII sampling is enabled or disabled and when all columns are PII-tagged. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant dbt_test
participant Macro_handle_dbt_test
participant Macro_query_test_result_rows
participant Macro_get_pii_columns_from_parent_model
participant Database
User->>dbt_test: Run test with config (PII sampling enabled/disabled)
dbt_test->>Macro_handle_dbt_test: Execute test logic
Macro_handle_dbt_test->>Macro_query_test_result_rows: Request sample rows (with flattened_test)
Macro_query_test_result_rows->>Macro_get_pii_columns_from_parent_model: Identify PII columns
Macro_get_pii_columns_from_parent_model-->>Macro_query_test_result_rows: Return list of PII columns
Macro_query_test_result_rows->>Database: Query test results (excluding PII columns)
Database-->>Macro_query_test_result_rows: Return sample rows
Macro_query_test_result_rows-->>Macro_handle_dbt_test: Return filtered samples
Macro_handle_dbt_test-->>dbt_test: Return test result with samples
dbt_test-->>User: Show test outcome and samples
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Assessment against linked issues
Poem
✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
|
👋 @devin-ai-integration[bot] |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
macros/edr/system/system_utils/is_pii_column.sql (1)
25-38: Consider optimizing the nested loop for better performance.The current implementation uses nested loops which could be optimized for better performance, especially with models having many columns and PII tags.
Consider using set operations for more efficient tag matching:
{% for column_node in column_nodes.values() %} {% set config_dict = column_node.get('config', {}) %} {% set config_tags = config_dict.get('tags', []) %} {% set global_tags = column_node.get('tags', []) %} {% set meta_dict = column_node.get('meta', {}) %} {% set meta_tags = meta_dict.get('tags', []) %} {% set all_column_tags = config_tags + global_tags + meta_tags %} - {% for pii_tag in pii_column_tags %} - {% if pii_tag in all_column_tags %} - {% do pii_columns.append(column_node.get('name')) %} - {% break %} - {% endif %} - {% endfor %} + {% if all_column_tags | intersect(pii_column_tags) | length > 0 %} + {% do pii_columns.append(column_node.get('name')) %} + {% endif %} {% endfor %}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
integration_tests/tests/test_column_pii_sampling.py(1 hunks)macros/edr/materializations/test/test.sql(2 hunks)macros/edr/system/system_utils/get_config_var.sql(1 hunks)macros/edr/system/system_utils/is_pii_column.sql(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
macros/edr/materializations/test/test.sql (3)
Learnt from: haritamar
PR: #825
File: macros/utils/table_operations/create_table_as.sql:0-0
Timestamp: 2025-07-22T15:41:57.338Z
Learning: The dbt.create_table_as macro breaks in dbt-fusion when not run from a materialization context, requiring custom implementations for table creation functionality in dbt-fusion compatible packages.
Learnt from: haritamar
PR: #827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.
Learnt from: haritamar
PR: #825
File: macros/utils/run_queries/agate_to_dict.sql:1-7
Timestamp: 2025-07-15T19:28:35.548Z
Learning: In dbt-data-reliability, the agate_to_dict macro in macros/utils/run_queries/agate_to_dict.sql intentionally returns different types based on fusion/non-fusion mode to handle breaking changes in dbt Fusion. In fusion mode, it returns the original agate_table, while in non-fusion mode it returns agate_table.to_dict(). This is because the content/structure of agate_table is different between the two modes.
🔇 Additional comments (7)
macros/edr/system/system_utils/get_config_var.sql (1)
58-59: LGTM! Configuration variables follow established patterns.The new configuration variables are well-named, have sensible defaults, and maintain backward compatibility by being disabled by default. The default PII tags cover common naming conventions.
macros/edr/system/system_utils/is_pii_column.sql (1)
1-42: Well-structured macro with comprehensive PII detection logic.The macro follows good defensive programming practices with early returns and null checks. The tag collection from multiple sources (config.tags, tags, meta.tags) ensures comprehensive coverage of dbt's column metadata structure.
macros/edr/materializations/test/test.sql (3)
54-55: LGTM! Parameter passing enables PII detection.The modification correctly passes the
flattened_testparameter to enable PII column detection in the query function.
107-137: Solid implementation of PII-aware column filtering.The logic correctly:
- Detects PII columns using the new macro
- Dynamically determines available columns
- Excludes PII columns from selection
- Handles the edge case where all columns are PII with a meaningful placeholder
123-130: No changes needed for column detectionAfter searching the codebase:
- The only use of
LIMIT 0is inmacros/edr/materializations/test/test.sql(lines 123–130).- All column metadata comes through
elementary.run_query, which wrapsdbt.run_queryand correctly returns column names across every supported adapter.- There are no adapter-specific overrides or alternate column-detection patterns in the repository.
You can safely rely on this generic approach.
integration_tests/tests/test_column_pii_sampling.py (2)
1-133: Excellent comprehensive test coverage for PII sampling functionality.The integration tests provide thorough coverage of the new feature including:
- Main functionality with PII filtering enabled
- Backward compatibility with feature disabled
- Edge case handling when all columns are PII
The test structure, data setup, and assertions are well-designed and should catch any regressions.
120-120: Ignore the all-PII sampling status concern
Thenot_nullconstraint is applied toSAFE_COLUMN, which contains only non-null integer values, so expecting"pass"is correct. Thedisable_samples_on_pii_columnsflag only suppresses sample row materialization—it does not change the underlying null-check logic. No changes needed.Likely an incorrect or invalid review comment.
| TEST_SAMPLE_ROW_COUNT = 5 | ||
|
|
||
|
|
||
| @pytest.mark.skip_targets(["clickhouse"]) |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Verify test compatibility with skipped database adapter.
The tests are marked to skip on ClickHouse. Ensure this is necessary and document why this adapter is excluded.
Let me check if there are adapter-specific issues or if this is a broader pattern:
🏁 Script executed:
#!/bin/bash
# Description: Check for ClickHouse-related skip patterns and adapter compatibility
echo "Searching for other ClickHouse skip patterns..."
rg -A 2 -B 2 "skip.*clickhouse" integration_tests/
echo -e "\nChecking for adapter-specific test issues..."
rg -A 5 -B 5 "clickhouse" integration_tests/ | head -20
echo -e "\nLooking for general adapter compatibility patterns..."
rg "skip_targets" integration_tests/ | head -10Length of output: 38846
Document reason for skipping PII sampling on ClickHouse
Please add a comment above the skip decorator to explain why column-level PII sampling tests aren’t supported by the ClickHouse adapter, mirroring the pattern in other tests.
• integration_tests/tests/test_column_pii_sampling.py (around line 26)
Example:
# Column-level PII sampling is not supported on ClickHouse
@pytest.mark.skip_targets(["clickhouse"])🤖 Prompt for AI Agents
In integration_tests/tests/test_column_pii_sampling.py at line 26, add a comment
above the @pytest.mark.skip_targets(["clickhouse"]) decorator explaining that
column-level PII sampling is not supported on ClickHouse. This comment should
clearly state the reason for skipping the test on ClickHouse, following the
existing pattern in other tests.
|
created #833 instead |
Implement column-level PII protection for sample collection
Summary
This PR extends the existing PII protection feature to work at the column level instead of just the table level. Previously, if a table was tagged as PII, no samples were collected from failing rows. Now users can tag specific columns as PII, and only those columns will be excluded from sample data while preserving non-sensitive columns for debugging.
Key Changes:
disable_samples_on_pii_columns(boolean, default: false) andpii_column_tags(list, default: ['pii', 'personal', 'sensitive'])get_pii_columns_from_parent_modelmacro to detect PII columns by checking column tags against configured PII tagsquery_test_result_rowsto dynamically exclude PII columns from SELECT statements instead of usingSELECT *Review & Testing Checklist for Human
config: {tags: ['pii']}and verify that test samples exclude those columns while including non-PII columnsdisable_samples_on_pii_columns: false(the default)pii_column_tagsworks with both string and list inputs, and that boolean config parsing works correctlyRecommended Test Plan:
disable_samples_on_pii_columns: truecd integration_tests/tests && pytest test_column_pii_sampling.py -vvvDiagram
%%{ init : { "theme" : "default" }}%% graph TD A["get_config_var.sql<br/>(config variables)"]:::minor-edit --> B["query_test_result_rows<br/>(sample collection)"]:::major-edit C["is_pii_column.sql<br/>(PII detection)"]:::major-edit --> B D["handle_dbt_test<br/>(test materialization)"]:::minor-edit --> B E["test_column_pii_sampling.py<br/>(integration tests)"]:::major-edit F["dbt model columns<br/>(with PII tags)"]:::context --> C B --> G["test_result_rows table<br/>(filtered samples)"]:::context subgraph Legend L1[Major Edit]:::major-edit L2[Minor Edit]:::minor-edit L3[Context/No Edit]:::context end classDef major-edit fill:#90EE90 classDef minor-edit fill:#87CEEB classDef context fill:#FFFFFFNotes
_no_non_pii_columnsto avoid SQL errorsSession Details:
Summary by CodeRabbit